1 Time series in the tidyverse

In R, most time series analysis pipelines make use of objects from the ts class in the stats package. This is standard base R, and is still the most common way of analysing time series data in R, probably because no one made a better way of doing it until relatively recently.

So, what’s not so great about base R time series analysis?

  1. Well the original ts data structure doesn’t follow the principles of tidy data.
  2. It often stores data in wide format
  3. It has unnatural index formatting (i.e. locations in the table instead of names of variables)

All of these make it not only difficult to analyse, but also near impossible to integrate with tidyverse functions.

Enter the tidyverse time series packages, which the originators (jokingly) call the tidyverts (tidy time series)!



2 tsibbledata

The tsibbledata package provides a diverse collection of datasets for learning how to work with tidy time series data. There are 12 time series datasets included in the package, each of which features unique time series characteristics or structures. These datasets are stored as tsibble objects (time series tibbles), which makes time series data compatible with the tidyverse.

library(tsibbledata)

Here are all the datasets contained in this package:

ansett
vic_elec
gafa_stock
aus_livestock
aus_production
aus_retail
global_economy
hh_budget
nyc_bikes
olympic_running
PBS
pelt

They cover a diverse range of time series patterns including regular and irregular data; non-seasonal, seasonal and multi-seasonal series; frequent (30 minutes) and infrequent (4 years) observations. Some data contain nested and crossed structures, while others have relationships between variables or series.

If you find that your data isn’t natively in tsibble format, you can easily convert tibbles to tsibbles: tibble format isn’t bad, but tidy time series analysis is much easier in tsibble format. Two functions within the tsibble package allow us to do this:

  • tsibble() creates a tsibble object
  • as_tsibble() coerces other objects to a tsibble

3 tsibble

The tsibble package by Earo Wang provides a tidy data structure for time series. Built on top of the tibble, a tsibble (or tbl_ts) follows the tidy data principles. A tsibble is essentially a tibble which preserves the time index alongside the data. There are two extra contextual semantics you need to know about to make effective use of tsibbles: key and index.

  • index: this identifies the time component of the data. Usually this will be a date or datetime.
  • key: keys are used within a tsibble to uniquely identify rows belonging to particular time series. So, in this way, a tsibble can hold information on multiple time series.

Each observation in a tsibble must be uniquely identified by a combination of key and index


Let’s load the necessary libraries and have a look at one of the datasets in the tsibbledata package:

library(tsibble)
library(tsibbledata)
head(vic_elec)

Now let’s see the index, key and interval for this tsibble:

index_var(vic_elec)
## [1] "Time"
key_vars(vic_elec)
## character(0)
interval(vic_elec)
## <interval[1]>
## [1] 30m

So the index is the Time variable, and there is no key (which essentially means that this tsibble holds just a single time series). The interval is reported as ‘30m’, which means observations are spaced 30 minutes apart.


3.0.1 tsibble and dplyr

Now we’ll start using regular tidyverse methods on tsibbles: this is the main advantage of tsibble over the older ts objects.

Row-wise verbs, for example filter() and arrange(), work for a tsibble in exactly the same way as for general tibble. However, column-wise verbs behave differently by intention: as one important example, the index column cannot be dropped from a tsibble.

So you do need to keep your index, but you can select() different columns in the usual way:

vic_elec %>%
  select(Time, Demand, Temperature)

Note that if you do try to drop the index, the tsibble will retain it by default:

# Time can't be dropped
vic_elec %>%
  select(Demand, Temperature)

You can also filter() rows as you would normally:

vic_elec %>%
  filter(Holiday == TRUE)

And you can also extract a section of a time series using the filter_index() function:

# let's get January 2012
vic_elec %>%
  filter_index("2012-01")


In terms of adding columns, you can mutate() as normal:

library(lubridate)

vic_year <- vic_elec %>%
  mutate(year = lubridate::year(Date))

vic_year


And you can summarise() in the same way you would with a normal tibble, although instead of group_by(), you should generally use index_by() (this is the counterpart of group_by() in temporal context, but it only groups the time index). You need this extra function because the index variable in a tsibble has to stay, but you might not always want your data summarised over such fine intervals.

For example, using group_by() on vic_elec doesn’t behave as you might expect: the Time variable is maintained:

vic_year %>%
  select(year, Temperature) %>%
  group_by(year) %>%
  summarise(mean_temp = mean(Temperature))


However, if you use index_by() it will group rows in the way you want

vic_year %>%
  select(year, Temperature) %>%
  index_by(year) %>%
  summarise(mean_temp = mean(Temperature))

You can also aggregate data to different time intervals using index_by(). For example, here are the mean temperatures for each day:

vic_year %>%
  index_by(date = as_date(Time)) %>%
  summarise(mean_temp = mean(Temperature))

3.0.2 tsibble and ggplot2

For visualisation, ggplot2 functions also work straightforwardly with tsibble data:

# plot temperature over time
vic_elec %>%
  filter(Holiday == FALSE) %>%
  ggplot() + 
  geom_line(aes(x = Time, y = Temperature))

We can see that this data is collected on a half-hourly basis, and so there are a lot of data points to show. Aggregation to a more coarse interval might lead to a more informative plot.

We saw briefly above that we can aggregate using the index_by() function. The lubridate functions floor_date(), celling_date() and round_date() are helpful to do all sorts of sub-daily rounding, while as_date() and year() can help transform to daily and annual numbers respectively. Finally yearweek(), yearmonth() and yearquarter() in the tsibble package cover other time ranges.

For example, if we wanted to summarise the info by date (rather than every half hour), we could do the following:

# aggregate by date
elec_date <- vic_elec %>%
  index_by(date = as_date(Time)) %>%
  summarise(temp_mean = mean(Temperature, na.rm = TRUE))

elec_date
# make a plot
elec_date %>%
  ggplot(aes(x = date, y = temp_mean)) +
  geom_line()


If we wanted to summarise by month, rather than date, we could do this:

elec_month <- vic_elec %>%
  index_by(month = month(Time, label = TRUE)) %>%
  summarise(temp_mean = mean(Temperature, na.rm = TRUE))

elec_month
# make a plot
elec_month %>%
  ggplot(aes(x = month, y = temp_mean)) +
  geom_point() + 
  geom_line(group = 1)


And if we wanted to summarise by year, rather than month, we could do this:

elec_year <- vic_elec %>%
  index_by(year = year(Time)) %>%
  summarise(temp_mean = mean(Temperature, na.rm = TRUE))

elec_year
# make a plot
elec_year %>%
  ggplot(aes(x = year, y = temp_mean)) +
  geom_col(fill = "steelblue", alpha = 0.7) + 
  ylab("Mean Temperature")+ 
  xlab("year")

3.0.3 Rolling windows

In addition to aggregation, calculating so-called ‘rolling window’ statistics is a common technique in time series analysis. Rolling window calculations apply summarising functions over a moving ‘time window’ on your data (i.e. a moving subset of the data).

There are a few common reasons you may want to use rolling calculations in time series analysis:

  • Measuring how central tendency varies over time, e.g. mean() and median()
  • Measuring how volatility changes over time, e.g. sd() and var()
  • Detecting changes in trend (fast vs slow moving averages)
  • Measuring how a relationship between two time series varies over time, e.g. cor() and cov()

The slider package provides functions allowing for different variations of rolling windows using purrr-like syntax. The most common example of a rolling window calculation is a moving average, and so we will focus on this as an example.

A moving average allows us to visualize how an average changes over time, which is very useful in cutting through the noise to detect a trend in a time series. By varying the window width (i.e. the number of observations included in the rolling calculation), we can vary the sensitivity of the window calculation.



Let’s see an example of a moving average using a window of width of 2001 points, corresponding to the current point, plus the 1000 points before, and 1000 points afterwards:

library(slider)
## Warning: package 'slider' was built under R version 3.6.2
## 
## Attaching package: 'slider'
## The following objects are masked from 'package:tsibble':
## 
##     pslide, pslide_chr, pslide_dbl, pslide_dfc, pslide_dfr, pslide_int,
##     pslide_lgl, slide, slide_chr, slide_dbl, slide_dfc, slide_dfr,
##     slide_int, slide_lgl, slide2, slide2_chr, slide2_dbl, slide2_dfc,
##     slide2_dfr, slide2_int, slide2_lgl
# calculate a rolling window
# the . in .f corresponds to the current window in the data, i.e. each subset
elec_rolling <- vic_elec %>% 
  mutate(
    temp_moving_avg = slide_dbl(
      .x = Temperature, 
      .f = ~ mean(., na.rm = TRUE),
      .before = 1000,
      .after = 1000
    )
  )

elec_rolling
# plot the data 
ggplot(elec_rolling) + 
  geom_line(aes(x = Date, y = Temperature), colour = "grey") + 
  geom_line(aes(x = Date, y = temp_moving_avg), colour = "red")


Task - 5 mins

The slide_dbl() function makes a number of arguments available, so let’s play around with them a little!

  • Try setting .before and .after to 100. How does the plot change?
  • What happens if you set .complete = TRUE?
Solution
  • With .before and .after set to 100 the plot becomes more ‘jagged’ as the mean ‘sees’ more of the short term variation in the data
  • If you set .complete = TRUE, the rolling average ‘waits’ until it has seen a complete range of before and after points before it starts reporting values, so pieces of the line are missing at the start (.before is incomplete here) and at the end (.after is incomplete here).

While the tsibble package provides more functions for you to use, these are the main ones you’ll turn to when exploring time series data.


4 feasts

The feasts package stands for: feature extraction and statistics for time series. It works with tidy temporal data provided by the tsibble package, and produces time series features, decompositions, statistical summaries and visualisations.

Let’s look at the tourism data from that package:

library(feasts)
tourism

Let’s check the index and key of the tsibble:

index_var(tourism)
## [1] "Quarter"
key_vars(tourism)
## [1] "Region"  "State"   "Purpose"

What are the available levels of Purpose?

tourism %>%
  distinct(Purpose)

This is already in tsibble format, so we can use tidyverse functions on it. For example, let’s filter() down to holidays and business trips, group_by() state, and then summarise() to calculate the total of such trips for each state.

holidays <- tourism %>%
  filter(Purpose %in% c("Holiday","Business")) %>%
  group_by(State) %>%
  summarise(Trips = sum(Trips))

holidays

4.1 Autoplot

As we mentioned, visualisation is often the first step in understanding patterns in time series data. While we know we can make use of ggplot2, the feasts package uses ggplot2 to produce customisable graphics for time series. You can straightforwardly plot your data over time using the feasts autoplot() function. However, make sure you’re not using the base R version of autoplot().

holidays %>% 
  autoplot(Trips) + 
  xlab("Year")


Task - 5 mins

What inferences (if any) can you make about the trends, seasonality and overall patterns of these time series?

Potential Answer

NSW, Queensland and Victoria all have the highest number of visitors. Overall, it seems after 2010 there is a gradual upwards trend in the data: visitor numbers are increasing. There seems to be a seasonal component, but this would need some different plotting techniques to narrow it down.


4.1.1 Seasonal patterns

We saw above that it might be useful to learn how to plot seasonal trends in the data. The feasts package has the gg_season() plot function allowing us to do this. A seasonal plot is similar to a regular time series plot, except the horizontal axis shows data from within each season. This plot type allows the underlying seasonal pattern to be seen more clearly, and is especially useful in identifying years in which the pattern changes.

For this, let’s just look at the top three states.

holidays %>%
  filter(State %in% c("Queensland", "New South Wales", "Victoria")) %>%
  gg_season(Trips)

The seasonal plot (gg_season()) wraps a seasonal period (in this case, quarters) over the x axis, allowing you to see how each quarter varies.

Task - 5 mins

What can you tell from this seasonal plot?

Potential Answers

Summer trips (Sep - March) seem to go down in Victoria, but up in Queensland. Could this be people coming to Queensland for their summer holidays? Overall 2017 seems to have the highest tourist rates. However, this might be easier to see if we split the data out a bit more into subseries.


4.1.2 Subseries

Now we can split our data out into subseries - that is, each plot will show, for one State and one quarter, the number of Trips in successive years. This might make it easier to observe trends over time, as the trend of recent years can also be seen in the spread between the lines.

holidays %>%
  filter(State %in% c("Queensland", "New South Wales", "Victoria")) %>%
  gg_subseries(Trips)


Task - 5 mins

What can you see from these plots regarding the overall trends and seasonality in the three states?

Potential Answers

New South Wales: holiday trips are highest in the first quarter of the years, and almost equal in the last three. There seemed to be a big dip in trip numbers between 2005 and 2010, before there has been an increase again.

Queensland: Trips are higher in the third quarter (winter months). Again, there was a small dip between years 2005 and 2010, but trip numbers are on the increase more recently

Victoria: Holiday trips are higher here again in the first quarter of the year (the summer months). The overall trend across the years is again on the rise after year 2015.